Speech recognition method and apparatus with noise adaptive standard pattern

ABSTRACT

In a speech recognition method, at least two standard feature vector series having different signal-to-noise ratios (SNRs) are prepared in advance for each category. Then, an SNR is calculated from an input speech signal, and a noise adaptive standard feature vector series is calculated by a linear interpolation method between the prepared standard feature vector series using the calculated SNR. Finally, a nonlinear expansion and reduction matching operation is performed upon the feature vectors of the noise adaptive standard feature vector series and the feature vectors of the input speech signal.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates to a noise adaptive speech recognition method and apparatus.

[0003] 2. Description of the Related Art

[0004] In order to recognize speech in a noisy enviroment, various speech recognition methods have been suggested.

[0005] A first prior art speech recognition method is based on a noise superposition learning technology which superposes noise data onto a standard speech signal to prepare noise adaptive standard speech data, since the noise data is known to some extent in a noisy enviroment during speech recognition. Thus, the noise condition of a learning environment is brought close to that of a recognized environment, to improve the performance of speech recognition in a noisy enviroment (see JP-A-9-198079).

[0006] In the above-described first prior art speech recognition method, however, even when a noisy enviroment is recognized in advance, the voice level of a speaker, the distance between the speaker and a microphone, the volume gain of an apparatus, the noise level and the like fluctuate on a time basis, so that a signal-to-noise ratio (SNR) also fluctuates on a time basis independent of the recognized noisy enviroment. Since a correlation between the SNR and the performance of speech recognition is very high, the fluctuation of the SNR would invalidate the noise learning effect of the above-described first prior art speech recognition method.

[0007] In a second prior art speech recognition method, two standard feature vector series having different SNRs such as 0 dB and 40 dB are prepared in advance for each category such as “hakata”. Then, a distance between a segment linking two corresponding standard feature vectors of the vector series and a feature vector of an input speech signal is calculated at each point of a two-dimensional grid formed by the input speech signal. Finally, a minimum value of the accumulated distances for categories are found and a final category having such a minimum accumulated distance is determined as a recognition result. Therefore, standard feature vectors can be adapted Lo any value of SNR between 0 dB and 40 dB, so as to obtain high performance speech recognition (see JP-A-10-133688).

[0008] In the above-described second prior art speech recognition method, however, the amount of calculation of the above-mentioned distances based upon three points, i.e., two standard feature vectors and one feature vector of an input speech signal at each grid point is enormous, which would increase the manufacturing cost of a speech recognition apparatus.

[0009] Also, in the above-described second prior art speech recognition method, since individual optimization is carried out at each grid point, when a range of corresponding standard feature vectors of the two feature vector series for one grid point interferes with a range of standard feature vectors for an adjacent grid point, an overmatching or mismatching operation would be carried out. That is, the power of a consonant is relatively small compared with a vowel. Therefore, when the SNR is too small, the power of noise would become equivalent to the power of consonants. For example, the first consonant such as “h” of the category “hakata” is buried in the noise. Thus, the category “hakata” is deformed to a category “akata”. At worst, all the consonants such as “h”, “k” and “t” of the category “hakata” are buried in the noise. Thus, the category “hakata” is deformed to a category “aaa”.

SUMMARY OF THE INVENTION

[0010] It is an object of the present invention to provide a noise adaptive speech recognition method and apparatus capable of decreasing the calculation amount and avoiding the overmatching or mismatching operation.

[0011] According to the present invention, at least two standard feature vector series having different SNRs are prepared in advance for each category. Then, an SNR is calculated from an input speech signal, and a noise adaptive standard feature vector series is calculated by a linear interpolation method between the prepared standard feature vector series using the calculated SNR. Finally, a nonlinear expansion and reduction matching operation is performed upon the feature vectors of the noise adaptive standard feature vector series and the feature vectors of the input speech signal.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] The present invention will be more clearly understood from the description set forth below, with reference to the accompanying drawings, wherein:

[0013]FIG. 1 is a block circuit diagram illustrating a first embodiment of the speech recognition apparatus according to the present invention;

[0014]FIG. 2 is a block circuit diagram illustrating a second embodiment of the speech recognition apparatus according to the present invention;

[0015]FIG. 3 is a diagram for explaining the operation of the preliminary matching unit of FIG. 2; FIG. 4 is a block circuit diagram illustrating a third embodiment of the speech recognition apparatus according to the present invention; and FIG. 5 is a diagram for explaining the operation of the partial matching unit of FIG. 4.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0016] Before the description of the preferred embodiments, the principle of the present invention will be now explained.

[0017] Generally, an input speech signal is a superposition signal of a pure speech signal and a noise signal in a linear spectrum region. Therefore, when a speech recognition is carried out in a predictable noisy enviroment, a noise signal is recorded in advance and is superposed onto standard speaker's speech signals, so that the noise condition of a recognized environment coincides with that of a learning environment. This is called a noise superposition learning method. Note that, as explained above, even when the power spectrum of a noisy enviroment is recognized in advance, the voice level of a speaker, the distance between the speaker and a microphone, the volume gain of an apparatus and the like fluctuate on a time basis, so that the SNR also fluctuates on a time basis independent of the noise condition of the recognized environment. Therefore, in the noise superposition learning method, an unknown parameter for defining the SNR has to be considered.

[0018] Now, a noise is considered to be superposed onto a pure speech signal, to form a speech signal under noisy conditions. That is,

Y ^(L) =S ^(L)+α^(L) ·N ^(L)   (1)

[0019] where Y^(L) is a power spectrum of a speech signal observed in a noisy enviroment;

[0020] S^(L) is a power spectrum of a pure speech component of the speech signal Y^(L);

[0021] N^(L) is a power spectrum of a noise signal; and

[0022] α is a gain of the noise signal.

[0023] Note that the gain α allocated to the noise N^(L) can be treated as the SNR.

[0024] In most of speech recognition apparatuses, a recognition parameter uses a logarithmic spectrum or its linearly converted parameter such as a cepstrum. Note that the cepstrum is a simple, linear conversion of the logarithmic spectrum. For simplifying the description, formula (1) is converted into a logarithmic spectrum form, i.e., $\begin{matrix} \begin{matrix} {Y = \quad {\log \quad \left( Y^{L} \right)}} \\ {= \quad {\log \quad \left( {S^{L} + {\alpha^{L} \cdot N^{L}}} \right)}} \\ {= \quad {\log \quad \left\{ {{\exp (S)} + {\exp \left( {\alpha \cdot N} \right)}} \right\}}} \\ {= \quad {f\left( {\alpha \cdot N} \right)}} \end{matrix} & (2) \end{matrix}$

[0025] where S=log (S^(l))

[0026] N=log (N^(L))

[0027] α=log (α^(L))

[0028] In formula (2), although the speech signal Y can be calculated by a given noise and a given gain α, the calculation of the speech signal Y actually needs a lot of calculation resources due to the presence of transcendental functions such as logarithmic functions and exponential functions. In the present invention, formula (2) is approximated by the linear terms of a Tailor expansion:

[0029] f(a, N) =f(aQ) No) +d(f( an, N.))(a -a₀)/da +d(f( a , NO) )(N -NO)/dN (3)

[0030] where f (α₀, N₀) is a standard pattern onto which a reference noise spectrum No with a reference gainα₀ is superposed. If the noise N per se is not changed in this noisy enviroment, formula (3) is replaced by: $\begin{matrix} \begin{matrix} {{f\left( {\alpha,N} \right)} = \quad {{f\left( {\alpha_{0},N_{0}} \right)} + {{\left( {f\left( {\alpha_{0},N_{0}} \right)} \right)}{\left( {\alpha - \alpha_{0}} \right)/{\alpha}}}}} \\ {= \quad {f_{0} + {\left( {f_{1},f_{0}} \right) \cdot {\left( {{SNR} - {SNR}_{0}} \right)/\left( {{SNR}_{1} - {SNR}_{0}} \right)}}}} \end{matrix} & (4) \end{matrix}$

[0031] where f₀(=f(α₀, N₀)) is a standard pattern, i.e., a standard feature vector series at SNR=SNR₀;

[0032] f₁(=f(d₁, N₁)) is a standard patter, i.e., a standard feature vector series at SNR=SNR₁; and

[0033] SNR is a SNR calculated for an unknown input speech signal.

[0034] Thus, an arbitrary standard pattern, i.e., a arbitrary standard feature vector series is calculated by a linear interpolation method between known standard patterns, i.e., known standard feature vector series using the calculated SNR based upon formula (4).

[0035] In FIG. 1, which illustrates a first embodiment of the speech recognition apparatus according to the present invention, an SNR multiplexed standard pattern storing unit 1 stores two standard patterns per one category, i.e., two standard feature vector series per one category onto which different noises associated with SNR₀ and SNR₁ are superposed. In this case, if the standard patterns are denoted by f₀ and f₁, respectively, f₀ and f₁ are feature vector series represented by:

f₀={μ₁, μ₂, . . . , μ_(m)}  (5)

f₁={{circumflex over (μ)}₁, {circumflex over (μ)}₂, . . . ,{circumflex over (μ)}_(m)}  (6)

[0036] where μ₁, μ₂, . . . , μ_(m), μ{circumflex over (μ)}₁, {circumflex over (μ)}₂, . . . , {circumflex over (μ)}_(m) are feature vectors on a time basis which are represented by:

μ_(i)=(μ_(i1), μ_(i2), . . . , μ_(in))   (7)

{circumflex over (μ)}_(i)=({circumflex over (μ)}_(i1), {circumflex over (μ)}_(i2), . . . , {circumflex over (μ)}_(in))   (8)

[0037] Note that feature vectors are obtained by analyzing waveforms of a speech signal at every predetermined short time period using a spectrum analysis output, a filter back output, a cepstrum or power, and arranging multi-dimensional vector results on a time basis.

[0038] A feature extracting unit 2 receives an input speech signal S_(in) to analyze the waveforms thereof, thus calculating a feature pattern, i.e., a feature vector series denoted by Y:

Y={Y₁, Y₂, . . . , Y_(m)}  (9)

[0039] where

Y_(i)={Y_(i1), Y_(i2) , . . . , Y_(in)}  (10)

[0040] An input SNR calculating unit 3 receives the input speech signal S_(in) to calculate an SNR thereof. The SNR of the input signal S_(in) is represented by:

SNR=10 log(P _(s))−10 log(P _(n))   (11)

[0041] where P_(s) is an average power of a speech component of the input speech signal S_(in); and

[0042] P_(n) is an average power of a noise component of the input speech signal S_(in). Generally, the SNR of the input speech signal S_(in) is approximated by: $\begin{matrix} {{SNR} = {{10\quad {\log \left( \frac{P_{s}}{P_{n}} \right)}} = {10\quad {\log \left( \frac{\overset{\_}{\sum\limits_{t \Subset T_{s}}{x^{2}(t)}}}{\overset{\_}{\sum\limits_{t \Subset T_{n}}{x^{2}(t)}}} \right)}}}} & (12) \end{matrix}$

[0043] where x(t) is a power of the input speech signal S_(in), at time t;

[0044] T_(s) is a set of speech time periods; and

[0045] T_(n) is a set of noise time periods. For example, a short time period is determined to be a speech time period or a noise time period by whether or not the power of the input speech signal S_(in) during the short time period is higher than a threshold value.

[0046] A speech segmentation unit (not shown) for detecting speech segmentation points in the input speech signal S_(in) is connected to the feature extracting unit 2 and the input SNR calculating unit 3. However, since the speech segmentation unit has no relationship to the present invention, the detailed description thereof is omitted.

[0047] A noise adaptive standard pattern generating unit 4 receives the standard patterns f₀ and f₁ (see formulae (5), (6), (7) and (8)) from the SNR multiplexed standard pattern storing unit 1 and the SNR from the input SNR calculating unit 3 to generate a noise adaptive standard pattern, i.e., a noise adaptive feature vector series f₂ by using formula (4):

f ₂ =f ₀+(f ₁ −f ₀)·(SNR−SNR ₀)/(SNR ₁ −SNR ₀)   (13)

[0048] Then, the noise adaptive standard pattern f₂ is stored in a noise adaptive standard pattern storing unit 5.

[0049] A matching unit 6 performs a nonlinear expansion and reduction matching operation such as a Viterbi matching operation using a Hidden Markov Model (HMM) or a dynamic program (DP) matching operation upon the feature vector series Y of the input speech signal S_(in) and the standard pattern (feature vector series) f₂. As a result, a final category is determined and outputted as an output signal S_(out).

[0050] In the speech reconition apparatus of FIG. 1, note that there are actually a'plurality of categories. Also, although only two standard patterns are prepared for each category, three or more standard patterns can be prepared. In this case, two of the standard patterns having SNR values closer to the SNR calculated by the input SNR calculating unit 3 are selected, and a noise adaptive standard pattern f₂ is calculated by the noise adaptive standard pattern generating unit 4 in accordance with the selected two standard patterns.

[0051] In the speech recognition apparatus of FIG. 1, since the noise adaptive standard pattern generating unit 4 and the matching unit 6 are operated in accordance with the SNR calculated by the input SNR calculating unit 3, a large delay time may occur, which is not practical. Note that, in most speech recognition apparatuses, a frame synchronization is carried out, so that parts of an input speech signal are sequentially treated, which decreases the delay time.

[0052] In order to decrease the delay time of the speech recognition apparatus of FIG. 1, one approach is that the input SNR calculating unit 3 calculates an SNR in accordance with a part of the input speech signal S_(in). For example, use is made of a noise component before the generation of the input speech signal S_(in) and a first part of the input speech signal S_(in). Another approach is that the noise adaptive standard pattern generating unit 4 calculates a noise adaptive standard pattern in accordance with a past value of SNR previously calculated by the input SNR calculating unit 3. That is, use is made of the previous input speech signal S_(in). Also, the past value of SNR can be replaced by an average value of the past values of SNR. In this case, a relative large delay time may occur only for the initial input speech signal S_(in).

[0053] In FIG. 2, which illustrates second embodiment of the speech recognition apparatus according to the present invention, a preliminary matching unit 7 is provided instead of the input SNR calculating unit 3 of FIG. 1.

[0054] The preliminary matching unit 7 calculates an SNR in accordance with a matching operation such as a Viterbi matching operation or a DP matching operation.

[0055] The operation of the preliminary matching unit 7 will be explained next with reference to FIG. 3 which shows a correct candidate path for standard patterns f₀ and f₁ in an HMM form.

[0056] The standard patterns f₀ and f₁ with different SNR, and SNR₁ are represented by (see formulae (5), (6), (7) and

μ_(ij) and {circumflex over (μ)}_(ij)

[0057] Also, the feature vectors of the input speech signal S_(in) is represented by (see formulae (9) and (10)):

Y_(ij)

[0058] If a deviation of Y_(ij) is represented by δ² _(ij), the accumulated distance 0(α) between μ_(ij)({circumflex over (μ)}_(ij)) and Y_(ij) can be represented by $\begin{matrix} {{O(\alpha)} = {\sum\limits_{i}{\sum\limits_{j}\frac{\left\{ {Y_{ij} - {\alpha \quad \mu_{ij}} - {\left( {1 - \alpha} \right){\hat{\mu}}_{ij}}} \right\}^{2}}{\delta_{ij}^{2}}}}} & (14) \end{matrix}$

[0059] where α is a parameter for an unknow SRN (α=SNR). When the accumulated distance 0(α) is minimum,

d0(α) / dα=0

[0060] Therefore, $\begin{matrix} {\alpha = \frac{\sum\limits_{i}{\sum\limits_{j}{\left( {{\hat{\mu}}_{ij} - \mu_{ij}} \right){\left( {{\hat{\mu}}_{ij} - Y_{ij}} \right)/\delta_{ij}^{2}}}}}{\sum\limits_{i}{\sum\limits_{j}{\left( {{\hat{\mu}}_{ij} - \mu_{ij}} \right){\left( {{\hat{\mu}}_{ij} + \mu_{ij}} \right)/\delta_{ij}^{2}}}}}} & (15) \end{matrix}$

[0061] Thus, the preliminary matching unit 7 calculates SNR (≡α) in accordance with formula (15). Then, the noise adaptive standard pattern calculates a noise adaptive standard pattern f₂ in accordance with formula (13). In this case, if α is used instead of SNR,

f ₂ =α·f ₀−(1−α)·f ₁  (16)

[0062] In FIG. 4, which illustrates a third embodiment of the speech recognition apparatus according to the present invention, a partial matching unit 8 is provided instead of the preliminary matching unit 7 of FIG. 2.

[0063] The partial matching unit 8 calculates an SNR in accordance with a matching operation such as a Viterbi matching operation or a DP matching operation.

[0064] The operation of the partial matching unit 8 will be explained next with reference to FIG. 5 which shows a partial candidate path for standard patterns f₀ and f₁ in an HMM form. That is, at time t=t′, a partial candidate path is obtained. Then, the noise adaptive standard pattern generating unit 4 renews a noise adaptive standard pattern f₂ in accordance with the partial candidate path by using formula (13) or (16). Also, at time t=t′. The matching unit 6 performs a matching operation upon the input speech signal S_(in) and the renewed noise adaptive standard pattern f₂. Finally, at time t=T, the operations of the partial matching unit 8 and the matching unit 6 are completed and a final category is output as the output signal S_(out).

[0065] As explained hereinabove, according to the present invention, since a noise adaptive SNR value (≡α) is determined in advance, the amount of calculation can be decreased, thus decreasing the manufacturing cost of a speech recognition apparatus. Also, since a noise adaptive SNR value is obtained by using the entire or part of a speech signal, the overmatching or mismatching operation can be avoided. 

1. A speech recognition method for recognizing an input speech signal for a plurality of categories, comprising the steps of: storing at least two standard feature vector series having different signal-to-noise ratios (SNRs) for each of said categories; calculating an SNR of said input speech signal; calculating a noise adaptive standard feature vector series by a linear interpolation method between said standard feature vector series using said calculated SNR; and performing a nonlinear expansion and reduction matching operation upon feature vectors of said noise adaptive standard feature vector series and feature vectors of said input speech signal.
 2. The method as set forth in claim 1, wherein said noise adaptive standard feature vector series calculating step calculates said noise adaptive standard feature vector series by f ₂=(f ₁ −f ₀)·(SNR−SNR ₀) / (SNR ₁ −SNR ₀) where f₂ is said noise adaptive standard feature vector series; f₀ and f₁ are said standard feature vector series; SNR₀ and SNR₁ are SNR_(s)of said standard feature vector series f₀ and f₁, respectively; and SNR is said calculated SNR of said input speech signal:
 3. The method as set forth in claim 1, wherein said SNR calculating step calculates said SNR of said input speech signal by a ratio of a speech component of said speech signal to a noise component before generation of said speech signal.
 4. The method as set forth in claim 1, wherein said noise adaptive standard feature vector series calculating step uses an SNR calculated for a previous input speech signal of said input speech signal as said calculated SNR.
 5. A speech recognition method for recognizing an input speech signal for a plurality of categories, comprising the steps of: storing at least two standard feature vector series having different signal-to-noise ratios (SNRs) for each of said categories; performing a preliminary matching operation upon feature vectors of said standard feature vector series and feature vectors of said input speech signal to obtain a correct candidate path, and calculating an SNR of said input speech signal so that accumulated distance along said correct candidate path is minimum; calculating a noise adaptive standard feature vector series by a linear interpolation method between said standard feature vector series using said calculated SNR; and performing a nonlinear expansion and reduction matching operation upon the feature vectors of said noise adaptive standard feature vector series and feature vectors of said input speech signal.
 6. A speech recognition method for recognizing an input speech signal for a plurality of categories, comprising: storing at least two standard feature vector series having different signal-to-noise ratios (SNRs,) for each of said categories; performing a partial matching operation upon feature vectors of said standard feature vector series and feature vectors of said input speech signal to obtain a partial candidate path, and calculating an intermediate SNR of said input speech signal so that accumulated distance along said partial candidate path is minimum; calculating a noise adaptive standard feature vector series by a linear interpolation method between said standard feature vector series using said calculated intermediate SNR; and performing a nonlinear expansion and reduction matching operation upon the feature vectors of said noise adaptive standard feature vector series and feature vectors of said input speech signal.
 7. A speech recognition apparatus for recognizing an input speech signal for a plurality of categories, comprising: a multiplexed standard pattern storing unit for storing at least two standard feature vector series having different signal-to-noise ratios (SNRs) for each of said categories; an input SNR calculating unit for calculating an SNR of said input speech signal; a noise adaptive standard pattern calculating unit for calculating a noise adaptive standard feature vector series by a linear interpolation method between said standard feature vector series using said calculated SNR; and a matching unit for performing a nonlinear expansion and reduction matching operation upon feature vectors of said noise adaptive standard feature vector series and feature vectors of said input speech signal.
 8. The apparatus as set forth in claim 7, wherein said noise adaptive standard pattern calculating unit calculates said noise adaptive standard feature vector series by f ₂=(f ₁ −f ₀)·(SNR−SNR ₀) / SNR ₁ −SNR ₀) where f₂ is said noise adaptive standard feature vector series; f₀ and f₁ are said standard feature vector series; SNR₀ and SNR₁ are SNRs of said standard feature vector series f₀ and f₁, respectively; and SNR is said calculated SNR of said input speech signal.
 9. The apparatus as set forth in claim 7, wherein said input SNR calculating unit calculates said SNR of said input speech signal by a ratio of a speech component of said speech signal to a noise component before generation of said speech signal.
 10. The apparatus as set forth in claim 7, wherein said noise adaptive standard pattern calculating unit uses an SNR calculated for a previous input speech signal of said input speech signal as said calculated SNR.
 11. A speech recognition apparatus for recognizing an input speech signal for a plurality of categories, comprising: a multiplexed standard pattern storing unit for storing at least two standard feature vector series having different signal-to-noise ratios (SNRs) for each of said categories; a preliminary matching unit for performing a preliminary matching operation upon feature vectors of said standard feature vector series and feature vectors of said input speech signal to obtain a correct candidate path, and calculating an SNR of said input speech signal so that accumulated distance along said correct candidate path is minimum; a noise adaptive standard pattern calculating unit for calculating a noise adaptive standard feature vector series by a linear interpolation method between said standard feature vector series using said calculated SNR; and a matching unit for performing a nonlinear expansion and reduction matching operation upon the feature vectors of said noise adaptive standard feature vector series and feature vectors of said input speech signal.
 12. A speech recognition apparatus for recognizing an input speech signal for a plurality of categories, comprising: a multiplexed standard pattern storing unit for storing at least two standard feature vector series having different signal-to-noise ratios (SNRs) for each of said categories; a partial matching in unit for performing a partial matching operation upon feature vectors of said standard feature vector series and feature vectors of said input speech signal to obtain a partial candidate path, and calculating an intermediate SNR of said input speech signal so that accumulated distance along said partial candidate path is minimum; a noise adaptive standard pattern calculating unit for calculating a noise adaptive standard feature vector series by a linear interpolation method between said standard feature vector series using said calculated intermediate SNR; and a matching unit for performing a nonlinear expansion and reduction matching operation upon the feature vectors of said noise adaptive standard feature vector series and feature vectors of said input speech signal. 